Disambiguation of Proper Names in Text
نویسندگان
چکیده
Identifying the occurrences of proper names in text and the entities they refer to can be a di cult task because of the manyto-manymapping between names and their referents. We analyze the types of ambiguity | structural and semantic | that make the discovery of proper names difcult in text, and describe the heuristics used to disambiguate names in Nominator, a fully-implemented module for proper name recognition developed at the IBM T.J. Watson Research Center. NOTE: This is a preprint of a paper to be published in the Proceedings of the 5th Applied Natural Language Processing Conference, March 31 to April 3, 1997, Washington, D.C. 1 Proper Name Identi cation in Natural Language Processing Text processing applications, such as machine translation systems, information retrieval systems or natural-language understanding systems, need to identify multi-word expressions that refer to proper names of people, organizations, places, laws and other entities. When encountering Mrs. Candy Hill in input text, for example, a machine translation system should not attempt to look up the translation of candy and hill, but should translate Mrs. to the appropriate personal title in the target language and preserve the rest of the name intact. Similarly, an information retrieval system should not attempt to expand Candy to all of its morphological variants or suggest synonyms (Wacholder et al. 1994). The need to identify proper names has two aspects: the recognition of known names and the discovery of new names. Since obtaining and maintaining a name database requires signi cant e ort, many applications need to operate in the absence of such a resource. Without a database, names need to be discovered in the text and linked to entities they refer to. Even where name databases exist, text needs to be scanned for new names that are formed when entities, such as countries or commercial companies, are created, or for unknown names which become important when the entities they refer to become topical. This situation is the norm for dynamic applications such as news providing services or Internet information indexing. The next Section describes the di erent types of proper name ambiguities we have observed. Section 3 discusses the role of context and world knowledge in their disambiguation; Section 4 describes the process of name discovery as implemented in Nominator, a module for proper name recognition developed at the IBM T.J. Watson Research Center. Sections 5-7 elaborate on Nominator's disambiguation heuristics. 2 The Ambiguity of Proper Names Name identi cation requires resolution of a subset of the types of structural and semantic ambiguities encountered in the analysis of nouns and noun phrases (NPs) in natural language processing. Like common nouns, ((Jensen and Binot 1987), (Hindle and Rooth 1993) and (Brill and Resnick 1994)), proper names exhibit structural ambiguity in prepositional phrase (PP) attachment and in conjunction scope. A PP may be attached to the preceding NP and form part of a single large name, as in NP[Midwest Center PP[for NP[Computer Research]]]. Alternatively it may be independent of the preceding NP, as in NP[Carnegie Hall] PP[for NP[Irwin Berlin]], where for separates two distinct names, Carnegie Hall and Irwin Berlin. As with PP-attachment of common noun phrases, the ambiguity is not always resolved, even in human sentence parsing (cf. the famous example I saw the girl in the park with the telescope). The location of an organization, for instance, could be part of its name (City University of New York) or an attached modi er (The Museum of Modern Art in New York City). Without knowledge of the o cial name, it is sometimes di cult to determine the exact boundaries of a proper name. Consider examples such as Western Co. of North America, Commodity Exchange in New York and Hebrew University in Jerusalem, Israel. Proper names contain ambiguous conjoined phrases. The components of Victoria and Albert Museum and IBM and Bell Laboratories look identical; however, and is part of the name of the museum in the rst example, but a conjunction joining two computer company names in the second. Although this problem is well known, a search of the computational literature shows that few solutions have been proposed, perhaps because the conjunct ambiguity problem is harder than PP attachment (though see (Agarwal and Boggess 1992) for a method of conjunct identi cation that relies on syntactic category and semantic label). Similar structural ambiguity exists with respect to the possessive pronoun, which may indicate a relationship between two names (e.g., Israel's Shimon Peres) or may constitute a component of a single name (e.g., Donoghue's Money Fund Report). The resolution of structural ambiguity such as PP attachment and conjunction scope is required in order to automatically establish the exact boundaries of proper names. Once these boundaries have been established, there is another type of well-known structural ambiguity, involving the internal structure of the proper name. For example, Professor of Far Eastern Art John Blake is parsed as [[Professor [of Far Eastern Art]] John Blake] whereas Professor Art Klein is [[Professor] Art Klein]. Proper names also display semantic ambiguity. Identi cation of the type of proper nouns resembles the problem of sense disambiguation for common nouns where, for instance, state taken out of context may refer either to a government body or the condition of a person or entity. A name variant taken out of context may be one of many types, e.g., Ford by itself could be a person (Gerald Ford), an organization (Ford Motors), a make of car (Ford), or a place (Ford, Michigan). Entity-type ambiguity is quite common, as places are named after famous people and companies are named after their owners or locations. In addition, naming conventions are sometimes disregarded by people who enjoy creating novel and unconventional names. A store namedMr. Tall and a woman named April Wednesday (McDonald 1993) come to mind. Like commonnouns, proper nouns exhibit systematic metonymy: United States refers either to a geographical area or to the political body which governs this area; Wall Street Journal refers to the printed object, its content, and the commercial entity that produces it. In addition, proper names resemble de nite noun phrases in that their intended referent may be ambiguous. The man may refer to more than one male individual previously mentioned in the discourse or present in the non-linguistic context; J. Smith may similarly refer to more than one individual named Joseph Smith, John Smith, Jane Smith, etc. Semantic ambiguity of names is very common because of the standard practice of using shorter names to stand for longer ones. Shared knowledge and context are crucial disambiguation factors. Paris, usually refers to the capital of France, rather than a city in Texas or the Trojan prince, but in a particular context, such as a discussion of Greek mythology, the presumed referent changes. Beyond the ambiguities that proper names share with commonnouns, some ambiguities are particular to names: noun phrases may be ambiguous between a name reading and a common noun phrase, as in Candy, the person's name, versus candy the food, or The House as an organization versus a house referring to a building. In English, capitalization usually disambiguates the two, though not at sentence beginnings: at the beginning of a sentence, the components and capitalization patterns of New Coke and New Sears are identical; only world knowledge informs us that New Coke is a product and Sears is a company. Furthermore, capitalization does not always disambiguate names from non-names because what constitutes a name as opposed to a non-name is not always clear. According to (Quirk et al. 1972) names, which consist of proper nouns (classi ed into personal names like Shakespeare, temporal names like Monday, or geographical names like Australia) have 'unique' reference. Proper nouns di er in their linguistic behavior from common nouns in that they mostly do not take determiners or have a plural form. However, some names do take determiners, as in The New York Times; in this case, they "are perfectly regular in taking the de nite article since they are basically premodi ed count nouns... The di erence between an ordinary common noun and an ordinary common noun turned name is that the unique reference of the name has been institutionalized, as is made overt in writing by initial capital letter." Quirk et al.'s description of names seems to indicate that capitalized words like Egyptian (an adjective) or Frenchmen (a noun referring to a set of individuals) are not names. It leaves capitalized sequences like Minimum Alternative Tax, Annual Report, and Chairman undetermined as to whether or not they are names. All of these ambiguities must be dealt with if proper names are to be identi ed correctly. In the rest of the paper we describe the resources and heuristics we have designed and implemented in Nominator and the extent to which they resolve these ambiguities. 3 Disambiguation Resources In general, two types of resources are available for disambiguation: context and world knowledge. Each of these can be exploited along a continuum, from 'cheaper' to computationally and manuallymore expensive usage. 'Cheaper' models, which include no context or world knowledge, do very little disambiguation. More 'expensive' models, which use full syntactic parsing, discourse models, inference and reasoning, require computational and human resources that may not always be available, as when massive amounts of text have to be rapidly processed on a regular basis. In addition, given the current state of the art, full parsing and extensive world knowledge would still not yield complete automatic ambiguity resolution. In designing Nominator, we have tried to achieve a balance between high accuracy and speed by adopting a model which uses minimal context and world knowledge. Nominator uses no syntactic contextual information. It applies a set of heuristics to a list of (multi-word) strings, based on patterns of capitalization, punctuation and location within the sentence and the document. This design choice di erentiates our approach from that of several similar projects. Most proper name recognizers that have been reported on in print either take as input text tagged by part-of-speech (e.g., the systems of (Paik et al. 1993) and (Mani et al. 1993)) or perform syntactic and/or morphological analysis on all words, including capitalized ones, that are part of candidate proper names (e.g., (Coates-Stephens 1993) and (McDonald 1993)). Several (e.g., (McDonald 1993), (Mani et al. 1993), (Paik et al. 1993) and (Cowie et al. 1992)) look in the local context of the candidate proper name for external information such as appositives (e.g., in a sequence such as Robin Clark, president of Clark Co.) or for human-subject verbs (e.g., say, plan) in order to determine the category of the candidate proper name. Nominator does not use this type of external context. Instead, Nominator makes use of a di erent kind of contextual information | proper names cooccuring in the document. It is a fairly standard convention in an edited document for one of the rst references to an entity (excluding a reference in the title) to include a relatively full form of its name. In a kind of discourse anaphora, other references to the entity take the form of shorter, more ambiguous variants. Nominator identi es the referent of the full form (see below) and then takes advantage of the discourse context provided by the list of names to associate shorter more ambiguous name occurrences with their intended referents. In terms of world knowledge, the most obvious resource is a database of known names. In fact, this is what many commercially available name identi cation applications use (e.g., Hayes 1994). A reliable database provides both accuracy and e ciency, if fast look-up methods are incorporated. A database also has the potential to resolve structural ambiguity; for example, if IBM and Apple Computers are listed individually in the database but IBM and Apple Computers is not, it may indicate a conjunction of two distinct names. A database may also contain default world knowledge information: e.g., with no other over-riding information, it may be safe to assume that the string McDonald's refers to an organization. But even if an existing database is reliable, names that are not yet in it must be discovered and information in the database must be over-ridden when appropriate. For example, if a new name such as IBM Credit Corp. occurs in the text but not in the database, while IBM exists in the database, automatic identi cation of IBM should be blocked in favor of the new name IBM Credit Corp. If a name database exists, Nominator can take advantage of it. However, our goal has been to design Nominator to function optimally in the absence of such a resource. In this case, Nominator consults a small authority le which contains information on about 3000 special 'name words' and their relevant lexical features. Listed are personal titles (e.g., Mr., King), organizational identi ers (including strong identi ers such as Inc. and weaker domain identi ers such as Arts) and names of large places (e.g., Los Angeles, California, but not Scarsdale, N.Y.). Also listed are exception words, such as upper-case lexical items that are unlikely to be single-word proper names (e.g., Very, I or TV) and lower-case lexical items (e.g., and and van) that can be parts of proper names. In addition, the authority le contains about 20,000 rst names. Our choice of disambiguation resources makes Nominator fast and robust. The precision and recall of Nominator, operating without a database of pre-existing proper names, is in the 90's while the processing rate is over 40Mg of text per hour on a RISC/6000 machine. (See (Ravin and Wacholder 1996) for details.) This e cient processing has been achieved at the cost of limiting the extent to which the program can 'understand' the text being analyzed and resolve potential ambiguity. Many wordsequences that are easily recognized by human readers as names are ambiguous for Nominator, given the restricted set of tools available to it. In cases where Nominator cannot resolve an ambiguity with relatively high con dence, we follow the principle that 'noisy information' is to be preferred to data omitted, so that no information is lost. In ambiguous cases, the module is designed to make conservative decisions, such as including non-names or non-name parts in otherwise valid name sequences. It assigns weak types such as ?HUMAN or fails to assign a type if the available information is not su cient. 4 The Name Discovery Process In this section, we give an overview of the process by which Nominator identi es and classi es proper names. Nominator's rst step is to build a list of candidate names for a document. Next, 'splitting' heuristics are applied to all candidate names for the purpose of breaking up complex names into smaller ones. Finally, Nominator groups together name variants that refer to the same entity. After information about names and their referents has been extracted from individual documents, an aggregation process combines the names collected from all the documents into a dictionary, or database of names, representative of the document collection. (For more details on the process, see (Ravin and Wacholder 1996)). We illustrate the process of name discovery with an excerpt taken from a Wall Street Journal article in the TIPSTER CD-ROM collection (NIST 1993). Paragraph breaks are omitted to conserve space. : : : The professional conduct of lawyers in other jurisdictions is guided by American Bar Association rules or by state bar ethics codes, none of which permit non-lawyers to be partners in law rms. The ABA has steadfastly reserved the title of partner and partnership perks (which include getting a stake of the rm's pro t) for those with law degrees. But Robert Jordan, a partner at Steptoe & Johnson who took the lead in drafting the new district bar code, said the ABA's rules were viewed as "too restrictive" by lawyers here. "The practice of law in Washington is very di erent from what it is in Dubuque," he said. : : : Some of these non-lawyer employees are paid at partners' levels. Yet, not having the partner title "makes non-lawyers working in law rms second-class citizens," said Mr. Jordan of Steptoe & Johnson. : : : Before the text is processed by Nominator, it is analyzed into tokens | sentences, words, tags, and punctuation elements. Nominator forms a candidate name list by scanning the tokenized document and collecting sequences of capitalized tokens (or words) as well as some special lower-case tokens, such as conjunctions and prepositions. The list of candidate names extracted from the sample document contains: American Bar Association Robert Jordan Steptoe & Johnson ABA Washington Dubuque Mr. Jordan of Steptoe & Johnson Each candidate name is examined for the presence of conjunctions, prepositions or possessive 's. A set of heuristics is applied to determine whether each candidate name should be split into smaller independent names. For example,Mr. Jordan of Steptoe & Johnson is split into Mr. Jordan and Steptoe & Johnson. Finally, Nominator links together variants that refer to the same entity. Because of standard English-language naming conventions, Mr. Jordan is grouped with Robert Jordan. ABA is grouped with American Bar Association as a possible abbreviation of the longer name. Each linked group is categorized by an entity type and assigned a 'canonical name' as its identi er. The canonical name is the fullest, least ambiguous label that can be used to refer to the entity. It may be one of the variants found in the document or it may be constructed from components of di erent ones As the links are formed, each group is assigned a type. In the sample output shown below, each canonical name is followed by its entity type and by the variants linked to it. American Bar Association (ORG) : ABA Steptoe & Johnson (ORG)
منابع مشابه
Disambiguation of Proper Names Using Finite-State Local Grammars
Like common noun phrases, proper names contain ambiguous conjoined phrases that make their delimitation and classification difficult in text. This paper presents a finite-state approach to the disambiguation of Portuguese candidate proper name strings containing the coordinating conjunction e (and). In such name strings, the conjunction can denote a relation between two independent names, but i...
متن کاملA Knowledge-free Method for Capitalized Word Disambiguation
In this paper we present an approach to the disambiguation of capitalized words when they are used in the positions where capitalization is expected, such as the first word in a sentence or after a period, quotes, etc.. Such words can act as proper names or can be just capitalized variants of common words. The main feature of our approach is that it uses a minimum of prebuilt resources and trie...
متن کاملبهبود صحت ابهامزدایی نام نویسنده با استفاده از خوشهبندی تجمّعی
Today, digital libraries are important academic resources including millions of citations and bibliographic essential information such as titles, author's names and location of publications. From the view of knowledge accumulation management, the ability to search fast, accurate, desired contents, has a great importance. The complexity and similarity in these resources cause many challenges and...
متن کاملAIDA: An Online Tool for Accurate Disambiguation of Named Entities in Text and Tables
We present AIDA, a framework and online tool for entity detection and disambiguation. Given a natural-language text or a Web table, we map mentions of ambiguous names onto canonical entities like people or places, registered in a knowledge base like DBpedia, Freebase, or YAGO. AIDA is a robust framework centred around collective disambiguation exploiting the prominence of entities, similarity b...
متن کاملGraph-based Semantic Relatedness for Named Entity Disambiguation
Natural Language is a mean to express and discuss about concepts, objects, events, i.e. it carries semantic contents. The Semantic Web aims at tightly coupling contents with their precise meanings. One of the ultimate roles of Natural Language Processing techniques is identifying the meaning of the text, providing effective ways to make a proper linkage between textual references and real world...
متن کاملSemantic Relatedness Approach for Named Entity Disambiguation
Natural Language is a mean to express and discuss about concepts, objects, events, i.e. it carries semantic contents. One of the ultimate aims of Natural Language Processing techniques is to identify the meaning of the text, providing effective ways to make a proper linkage between textual references and their referents, that is real world objects. This work addresses the problem of giving a se...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997